Date created: 7.04.2020
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
Question: We use to hear the term “exponential growth” all the time. But the true data scientist must object - what is the typical growth model for virus pandemics?
Exponential growth is a mathematically convenient model for the early stages of an outbreak of an infectious disease. Existing methods for producing Epidemic forecasts often make an implicit assumption that growth is exponential, at least while susceptible depletion remains negligible. However, empirical studies suggest that many infectious disease outbreaks display sub-exponential or polynomical growth early in the epidemic.
To understand the CoVID-19 pandemic curve, let’s evaluate time series data of confirmed cases based on statistical approach.
First, let’s assume, that CoVID-19 Pandemic Curve has an Exponential growth.
What is Exponential growth? It is a specific way that a quantity increases over time.
The formula that illustrates the concept of exponential growth is as follows:
$$n_t = n_0*b^t$$In case of modeling CoVID-19 Pandemic, it would show the number of Coronavirus cases at any given time $t$, $n_0$ is the number of cases at initial moment of time, $b$ is a growth factor, showing the number of people infected by each sick person.
If we model a situation where initially there is 1 infected person and each sick person infects $2$ other people ($b = 2$), we’ll end up having 1 billion infected people in just 1 month:
x = []
y = []
for i in range(31):
x.append(i)
y.append(2**i)
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Number of cases in 2 weeks", "Number of cases in a month"))
fig.add_scatter(x=x[:10], y=y[:10], line=dict(color='red', width=2), row=1, col=1)
fig.add_scatter(x=x, y=y, line=dict(color='blue', width=2), row=1, col=2)
fig.update_layout(title_text="Graph of Exponential Growth with Growth Factor 2",
title_font_size=30,
xaxis_title="Days",
yaxis_title='''Number of Cases (Growth Factor 2)''')
fig.update(layout_showlegend=False)
#fig.show()
Let’s compare the shape of these graphs to the real curves of Total Coronavirus Cases and Deaths. The data used for the next plot were collected from European Centre for Disease Prevention and Control website https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide. It shows daily number of cases and deaths per country:
df = pd.read_excel('COVID-19-geographic-disbtribution-worldwide-2020-04-09.xlsx')
#world_stat.Date = pd.to_datetime(world_stat.Date)
df.head()
First of all, let's group the data by date:
total_cases = df.groupby("dateRep")[["cases","deaths"]].sum()
total_cases
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Daily Cases", "Daily Deaths"))
fig.add_scatter(x=total_cases.index, y=total_cases['cases'], line=dict(color='blue', width=2), row=1, col=1)
fig.add_scatter(x=total_cases.index, y=total_cases['deaths'], line=dict(color='red', width=2), row=1, col=2)
fig.update_layout(title_text="Coronavirus Cases and Deaths Daily",
title_font_size=30,
xaxis_title="Days",
yaxis_title='''Number of Cases''')
fig.update(layout_showlegend=False)
fig.show()
Apply cumulative sum:
total_cum = total_cases.cumsum(axis = 0)
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Total Cases", "Total Deaths"))
fig.add_scatter(x=total_cum.index, y=total_cum['cases'], line=dict(color='blue', width=2), row=1, col=1)
fig.add_scatter(x=total_cum.index, y=total_cum['deaths'], line=dict(color='red', width=2), row=1, col=2)
fig.update_layout(title_text="Total Coronavirus Cases and Deaths",
title_font_size=30,
xaxis_title="Days",
yaxis_title='''Total Coronavirus Cases''')
fig.update(layout_showlegend=False)
#fig.update_xaxes(tickangle=90)
fig.show()
The curve of the Total Coronavirus Cases and Deaths has similar to exponential shape.
Let’s find the real growth factor of the Coronavirus pandemia.
The best method to find the growth factor from empirical daily observations is to use a statistical model called Linear Regression. Linear Regression allows to estimate the best values for $a$ and $w$ in the following formula, given empirical observations for $y$ and $t$:
$$y = a +w*t,$$where $y = \log(n_t)$, $a = \log(n_0)$, $w = \log(b)$.
The last formula contains log of the number of cases instead of the number of infections and log of growth factor instead of growth factor.
Let's apply log transformation to total cases and deaths and then use the statsmodels library to estimate the Linear Regression function:
import numpy as np
total_cum['log_cases'] = np.log(total_cum['cases']+1)
total_cum['log_deaths'] = np.log(total_cum['deaths']+1)
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Total Cases (Logarithmic Scale)", "Total Deaths (Logarithmic Scale)"))
fig.add_scatter(x=total_cum.index, y=total_cum['log_cases'], line=dict(color='blue', width=2), row=1, col=1)
fig.add_scatter(x=total_cum.index, y=total_cum['log_deaths'], line=dict(color='red', width=2), row=1, col=2)
fig.update_layout(title_text="Coronavirus Cases and Deaths",
title_font_size=30,
xaxis_title="Days",
yaxis_title='''Total Coronavirus Cases''')
fig.update(layout_showlegend=False)
#fig.update_xaxes(tickangle=90)
fig.show()
d = total_cum.reset_index()
import statsmodels.api as sm
X = d.index
X = sm.add_constant(X)
y = d.log_cases
mod = sm.OLS(y[60:], X[60:])
res = mod.fit()
print(res.summary())
Based on adjusted R-squared and p-value we could make a conlusion that the model fits the data really well. The statsmodels table gives the values for $a$ and $w$ under coef (in the middle):
Thus, the linear regression function is $$y = 6.1382 + 0.0804*t$$
Going back to the original formula for exponential growth, we get the actual formula for the coronavirus pandemic: $$n_t = 463 *1.08^t$$
def linear_predictions1(t):
return np.exp(6.1382) * (np.exp(0.0804) ** t)
#d = d.reset_index()
d['index'] = d.index
d['cases_predictions'] = d['index'].apply(linear_predictions1)
df_melt = d.melt(id_vars='index', value_vars=['cases', 'cases_predictions'])
fig = px.line(df_melt, x='index' , y='value' , color='variable')
fig.update_layout(title_text="Coronavirus Cases and Cases Projection Assuming Exponential Growth",
title_font_size=24,
xaxis_title="Days since 2020-01-01",
yaxis_title='''Total Coronavirus Cases''')
fig.show()
Similarly obtain the actual formula for the CoVID-19 deaths data: $$n_t = 6.21 *1.1^t$$
import statsmodels.api as sm
X = d.index
X = sm.add_constant(X)
y = d.log_deaths
mod = sm.OLS(y[60:], X[60:])
res = mod.fit()
print(res.summary())
def linear_predictions(t):
return np.exp(1.8264) * (np.exp(0.0947) ** t)
d['deaths_predictions'] = d['index'].apply(linear_predictions)
df_melt = d.melt(id_vars='index', value_vars=['deaths', 'deaths_predictions'])
fig = px.line(df_melt, x='index' , y='value' , color='variable')
fig.update_layout(title_text="Coronavirus Deaths and Deaths Projection Assuming Exponential Growth",
title_font_size=24,
xaxis_title="Days since 2020-01-01",
yaxis_title='''Total Coronavirus Deaths''')
fig.show()
Despite the fact, that exponential curve fitted observed data almost perfectly, Exponential based infected population projection are subject to high risk due to daily changing situation. The repeated projection at 3-7 days are necessary to control the curve.
Only in China the first wave of pandemic has reached to its end. Rest of the countries are still active areas and time of its peak, infected population size will depend on effectiveness of community supported national action policies such that social distancing, health system capacity, access to food and medicine and others.
Using the function that we have estimated using the Exponential Growth curve, let's predict infected population size in a week after the last day of the dataset 2020-04-09:
x = [i for i in range(108)]
y = linear_predictions1(x)
y1 = linear_predictions(x)
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Total Cases", "Total Deaths"))
fig.add_trace(go.Scatter(x=x, y=y), row=1, col=1)
fig.add_trace(go.Scatter(x=x, y=y1), row=1, col=2)
fig.update_layout(title_text="CoVID-19 Projection and 1 Week Prediction Assuming Exponential Growth",
title_font_size=24,
xaxis_title="Days since 2020-01-01",
yaxis_title='''Total Cases and Deaths''')
fig.update(layout_showlegend=False)
#fig.update_xaxes(tickangle=90)
fig.show()
Now the question arise: is exponential growth describes the observed data best of all?
The obvious change point in infected population is 22nd of February - the date when the first CoVid-19 was registered in Italy. The pandemic curve starting from this stage may be estimated by polynomial curve of the 3rd order even better: $$n_t = a_0 + a_1*t+a_2*t^2 + a_3*t^3$$
d1 = total_cum[53:].reset_index().reset_index()
d1.head()
d1['t'] = d1['index']
d1['t^2'] = d1['t'].apply(lambda x: x**2)
d1['t^3'] = d1['t'].apply(lambda x: x**3)
d1['t^4'] = d1['t'].apply(lambda x: x**4)
X = d1[['t','t^2','t^3']]
X = sm.add_constant(X)
y = d1.cases
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
def polinomial_predictions(t):
return np.asarray(83590) + np.asarray(1861.7047)*t -np.asarray(412.8750)*np.power(t,2) + np.asarray(21.9020)*np.power(t,3)
d1['cases_predictions'] = d1['t'].apply(polinomial_predictions)
df_melt = d1.melt(id_vars='index', value_vars=['cases', 'cases_predictions'])
fig = px.line(df_melt, x='index' , y='value' , color='variable')
fig.update_layout(title_text="Coronavirus Cases and Projection Assuming Polynomial Growth",
title_font_size=24,
xaxis_title="Days since 2020-02-22",
yaxis_title='''Total Coronavirus Cases''')
fig.show()
As we have assumed, polynomial growth fits observed data starting 2020-02-22 much better (R-squared = 0.998). Similarly obtain the formula for the CoVID-19 deaths data assuming polynomial growth and make predictions:
import statsmodels.api as sm
X = d1[['t','t^2','t^3']]
X = sm.add_constant(X)
y = d1.deaths
mod = sm.OLS(y, X)
res = mod.fit()
#print(res.summary())
def polinomial_predictions1(t):
return np.asarray(1165.3676) + np.asarray(729.9956)*t -np.asarray(67.7441)*np.power(t,2) + np.asarray(1.9513)*np.power(t,3)
d1['deaths_predictions'] = d1['t'].apply(polinomial_predictions1)
df_melt = d1.melt(id_vars='index', value_vars=['deaths', 'deaths_predictions'])
fig = px.line(df_melt, x='index' , y='value' , color='variable')
fig.update_layout(title_text="Coronavirus Deaths and Projection Assuming Polynomial Growth",
title_font_size=24,
xaxis_title="Days since 2020-02-22",
yaxis_title='''Total Coronavirus Deaths''')
fig.show()
The Covid-19 virus has affacted lives of everyone. The governments are implementing strict measure of quarantine in order to stop the spread of the virus. In this notebook, I would like to summarize the statistics of the virus progress around the world and to see, whether the actions of social-distancing and isolation are impacting the speed and fatality of contracting. All of the hypothesis will be build around the data availible. However, we shall not forget, that this data might be deviated from the truth and thus, all conclusions derived from this data may not be 100% accurate.
The first data source is a worldwide-aggregated statistics of the number of confirmed Coronavirus cases, number of recovered patients, deathes and increase rate of the virus spread. The datasource is updated daily. The first record was created on the 22nd of January 2020
world_stat = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/master/data/worldwide-aggregated.csv')
world_stat.Date = pd.to_datetime(world_stat.Date)
world_stat.head()
fig = px.line(world_stat)
fig.add_scatter(x=world_stat['Date'], y=world_stat['Confirmed'], line=dict(color='blue', width=2), name='Confirmed cases')
fig.add_scatter(x=world_stat['Date'], y=world_stat['Deaths'], line=dict(color='red', width=4), name='Deaths')
fig.add_scatter(x=world_stat['Date'], y=world_stat['Recovered'], line=dict(color='green', width=3), name='Recovered')
fig.update_layout(title_text="World's Covid-19 Statistics",
title_font_size=30,
xaxis_title="Date",
yaxis_title="Number of citizens")
fig.show()
The plot above illustrates the exponential growth of confirmed coved-19 cases over the course of 4 month. The number of recovered is substainsially higher, than of the number of patients, who passed away. However, there is no decrease (that might have been expected due to governmnet measures) in the virus spread. Maybe, this is the result of higher testing numbers and/or incubation period of the virus. In the future work (if the data availible), this can be explored further.
fig = go.Figure(data=[
go.Bar(name='Recovered', x=world_stat['Date'], y=world_stat['Recovered']),
go.Bar(name='Deaths', x=world_stat['Date'], y=world_stat['Deaths'])])
fig.update_layout(barmode='group',
title_text="World Covid-19 Deaths vs Recovery",
title_font_size=30,
xaxis_title="Date",
yaxis_title="Number of citizens")
fig.show()
This plot again, illustarates the number of recovered vs the number of deathes, where number of recovered grows by a steeper curve. The number of documented from covid-19 deathes on the 6th of Aprile 2020 is over 74k. Comparing to the pandemics, it is on the same level as 2009 swine flu pandemic by the number of human lives taken.
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Deaths to New Cases Ratio", "Recovered to New Cases Ratio"))
fig.add_scatter(x=world_stat['Date'], y=world_stat['Deaths']/world_stat['Confirmed'], line=dict(color='red', width=2), row=1, col=1)
fig.add_scatter(x=world_stat['Date'], y=world_stat['Recovered']/world_stat['Confirmed'], line=dict(color='blue', width=2), row=1, col=2)
fig.show()
The figure_1 shows the ration of deathes to the confirmed cases over time. And figure_2 shows the ratio of recoveries to total registered cases respectively. The death ratio was at its minimum on the 5th of February, with only 2.04% and maxed out on the 6 of April with 5.55% and appeares to be growing. This might be due to lack of hospital beds and ITUs, as the awerness is raised. The recovery ratio peaked on the 8th of March with 55.2%. Since the middle of March the was a recrease in recovery rate, where since 28th of March is plateaus at approximately 20%.
- The mean avarage values are:
- mortality: 3.48%
- recovery: 24.9 %
# a = (world_stat['Recovered']/world_stat['Confirmed']).mean()*100
# (world_stat['Deaths']/world_stat['Confirmed']).mean()
The second datasource comes from kaggle competition (https://www.kaggle.com/c/covid19-global-forecasting-week-3)
This dataframe contains the information about each country's statistic separately, dated from 22nd of January to 5th of April 2020
During this part of descriptive analysis, we can take a closer look at each country's state.
agg_stats = pd.read_csv('train.csv')
agg_stats.Date = pd.to_datetime(agg_stats.Date)
agg_stats.head()
grouped = agg_stats.groupby('Country_Region')[['ConfirmedCases', 'Fatalities', 'Date']].agg(max)
grouped.reset_index(inplace=True)
iso_df = pd.read_csv('plotly_countries_and_codes.csv')
iso_dict = pd.Series(iso_df.CODE.values,index=iso_df.COUNTRY).to_dict()
iso_dict['US'] = 'USA'
grouped['iso'] = grouped['Country_Region'].map(iso_dict)
fig = px.choropleth(grouped, locations="iso",
color="ConfirmedCases",
hover_name="Country_Region",
color_continuous_scale=px.colors.sequential.Plasma)
fig.update_layout(title_text=f"Countries vs Confirmed Cases {agg_stats.Date.max()}")
fig.show()
This map shows the number of convirmed cases by the intensity of color. US, Italy and Spain are the ones of the most affected.
fig = px.choropleth(grouped, locations="iso",
color="Fatalities",
hover_name="Country_Region",
color_continuous_scale=px.colors.sequential.Plasma)
fig.update_layout(title_text=f"Countries: Fatality rates {agg_stats.Date.max()}")
fig.show()
Even though the number of infected in the US in substantially higher, as of April 5th, Italy has lost the highest number of citizens - 15 887
The barcharts below summarize the information about top20 and bottom20 countries by the number of confirmed cases and deaths
fig = make_subplots(
rows=1, cols=2, vertical_spacing=0.6,
subplot_titles=("20 Countries with the most confirmed cases", "20 countries with the least confirmed cases"))
grouped = grouped.sort_values(by='ConfirmedCases', ascending=False)
grouped_top_20 = grouped.iloc[:20]
grouped_bot_20 = grouped.iloc[-20:].sort_values(by='ConfirmedCases')
fig.add_trace(go.Bar(x=grouped_top_20['Country_Region'], y=grouped_top_20['ConfirmedCases'], name='Confirmed_high'), row=1, col=1)
fig.add_trace(go.Bar(x=grouped_bot_20['Country_Region'], y=grouped_bot_20['ConfirmedCases'], name='Confirmed_low'), row=1, col=2)
fig.update_xaxes(tickangle=65)
fig.update_layout(title=f'{agg_stats.Date.max()}')
fig.show()
fig = make_subplots(
rows=1, cols=2,
subplot_titles=('20 Countries with the highest Death Rate', '20 countries with the lowest (or == 1) death rate'))
grouped = grouped.sort_values(by='Fatalities', ascending=False)
grouped_top_20 = grouped.iloc[:20]
grouped_bot_20 = grouped[grouped.Fatalities>0].iloc[-20:].sort_values(by='Fatalities')
fig.add_trace(go.Bar(x=grouped_top_20['Country_Region'], y=grouped_top_20['Fatalities'], name='Mortality_high'), row=1, col=1)
fig.add_trace(go.Bar(x=grouped_bot_20['Country_Region'], y=grouped_bot_20['Fatalities'], name='Mortality_low'), row=1, col=2)
fig.update_xaxes(tickangle=65)
fig.update_layout(title=f'{agg_stats.Date.max()}')
fig.show()
List of Countries, where no registed deaths from covid-19 occured
grouped[grouped.Fatalities==0][['Country_Region']].reset_index(drop=True)
fig = go.Figure(data=[
go.Bar(name='Recovered', x=grouped['Country_Region'], y=grouped['ConfirmedCases'])
])
fig.update_layout(barmode='group',
title_text=f"Countries vs Confirmed Cases Total, {agg_stats.Date.max()}",
title_font_size=30,
xaxis_title="Country",
yaxis_title="Confirmed Cases Total", xaxis_tickangle=-45)
fig.show()
Now, when we have built an understanding around general statistics around coved-19, the time has come to look through the measures, that the governments are undertaking and try to infer their impact on the situation.
The dataset used in this part contains info about government actions from different souces. The last datapoint added is dated March 27th
In this section, I would like to answer the following questions:
gov_measures = pd.read_csv('acaps-covid-19-government-measures-dataset.csv')
na_date = gov_measures[gov_measures.date_implemented.isna()]
gov_measures = gov_measures[~gov_measures.date_implemented.isna()]
lst = list(gov_measures[gov_measures.date_implemented.map(len) == 5]['date_implemented'])
tbd = gov_measures[gov_measures.date_implemented.isin(lst)]
gov_measures = gov_measures[~gov_measures.date_implemented.isin(lst)]
gov_measures['date_implemented'] = pd.to_datetime(gov_measures['date_implemented'])
gov_measures.category = gov_measures.category.replace({'Movement Restriction':'Movement restrictions',
'Movement Restrictions':'Movement restrictions',
'Social and Economic Measures': 'Social and economic measures',
'Social Distancing': 'Social distancing'})
# gov_measures[['country', 'iso', 'measure', 'date_implemented', 'category']].sort_values(by='date_implemented')
measures = pd.DataFrame(gov_measures.measure.value_counts()).reset_index()
measures.columns = ['measure', 'count']
fig = go.Figure(data=[
go.Bar(name='Government Measures', x=measures['measure'], y=measures['count'])
])
fig.update_layout(barmode='group',
title_text="Most common Government Measures to stop spread of Covid-19",
title_font_size=30,
yaxis_title="Number of Governments to undertake", xaxis_tickangle=-40)
fig.show()
Intuitively, we could have guessed, that the most popular government measures are the quarantine policy and travel restrictions. All of these restrictions can be summarized into 4 categories:Public Health, Social & Economic, Social distaning and Movement Restrictions. More detailed breakdown on the graph below.
fig = make_subplots(
rows=1, cols=4,
subplot_titles=(list(gov_measures.category.unique())))
m = list(gov_measures.category.unique())
lay = {0:(1,1),
1:(1,2),
2:(1,3),
3:(1,4)}
for i, category in enumerate(m):
temp = pd.DataFrame(gov_measures[gov_measures.category==category]['measure'].value_counts()).reset_index()
fig.append_trace(go.Bar(x=temp['index'], y=temp['measure']), row=lay[i][0], col=lay[i][1])
fig.update_layout(title_text="Measures by category",
title_font_size=30)
fig.update_xaxes(tickangle=65)
fig.show()
The next chart illustates the cumulative number of policies effective through the course of January-April 2020. The most "awakening" period started after the 1st of March
implementations = pd.DataFrame(gov_measures.date_implemented.value_counts()).reset_index()
implementations.columns = ['date', 'count']
implementations = implementations.sort_values(by='date')
implementations['running_count'] = implementations[['count']].cumsum()
fig = px.line(implementations)
fig.add_scatter(x=implementations['date'], y=implementations['running_count'],mode='lines+markers', name="measures undertaken")
fig.update_layout(title_text="# Government Restrictions Effeactive arounf the World",
title_font_size=30,
xaxis_title="Date",
yaxis_title="Number of restriction")
fig.show()
The chart of number of implemented restrictions reminds of the exponensial growth of confirmed cases around the world
g_m = gov_measures.groupby(['country', 'measure'])['date_implemented'].agg('max').reset_index()
g_m_quarantine = g_m[(g_m.measure == 'Introduction of quarantine policies')].sort_values(by='date_implemented')
fig = px.scatter(x=g_m_quarantine['country'],
y=g_m_quarantine['date_implemented'])
fig.update_layout(xaxis_type='category',
title_text='Timeline of countries starting Quarantine',
xaxis_title="Country",
yaxis_title="Date effective")
fig.update_xaxes(tickangle=65)
fig.show()
This scatterplot shows, which countries were the first and the last ones to react to the virus outbreak. We will take a closer look at the "first" countries - Afganistan and Germany, who introduced the first policies on the 14th and 16th of February respectively. As well as some the last ones - Cuba and Georgia, who reacted after the 20th of March.
def measures_vs_country(c_of_int, i):
country_ = pd.merge(agg_stats[agg_stats.Country_Region==c_of_int][['Country_Region', 'Date', 'ConfirmedCases','Fatalities']],
gov_measures[gov_measures.country==c_of_int][['country', 'measure', 'date_implemented', 'category']],
left_on=['Date', 'Country_Region'], right_on=['date_implemented', 'country'], how='left')[['Country_Region', 'Date', 'ConfirmedCases', 'Fatalities', 'measure', 'category']]
fig = px.line(country_)
fig.add_scatter(x=country_['Date'], y=country_['ConfirmedCases'],mode='lines', name="ConfirmedCases")
fig.add_scatter(x=country_['Date'], y=country_['Fatalities'],mode='lines', name="Fatalities")
work = gov_measures[gov_measures.country== c_of_int].sort_values(by='date_implemented')
fig.add_scatter(x=work['date_implemented'], y=[_*i for _ in range(1, len(work)+1)],mode='markers+text', text=work['measure'], name="Measures", hovertext=work['measure'])
fig.update_layout(title=f'Number of Cases in {c_of_int} and dates of measures undertaken')
fig.show()
measures_vs_country('Afghanistan', 20)
measures_vs_country('Germany', 5000)
Both countries - Germany and Afghanistan declared State of the Emergency, when there were 0 confirmed cases of covid. The populations of countries in 35.5 M - Afghanistan and 83 M Germany. There is no reliable info about number of tests conducted in Afghanistan. Germany - (based on: https://www.worldometers.info/coronavirus/) 918,460 test as of April 7th.
In Afghanistan the first case registered on 24th of February, but the total quarantine was introduced sooner - on the 12th of February.
Germany quarantine policies was introduced on the 14th of February. General Lockdown was anounced on the 21st of March when number of affected reaced 22K and number of Fatalities - 84.
measures_vs_country('Cuba', 20)
It is clear from the graph, that Cuba reacted to late. By the time citizens were quarantined (March 24th), there were already 48 infected (considering the population of 11M). The growth of confirmed cases is a steep exponential curve.
measures_vs_country('Georgia', 20)
Similarly, Georgia started taking action only after the official number of cases was above 30.
Next, we will take a look at which countries managed to react to the virus before its spread
into_to_q = pd.merge(agg_stats[['Country_Region', 'Date', 'ConfirmedCases','Fatalities']],
gov_measures[['country', 'measure', 'date_implemented', 'category']],
left_on=['Date', 'Country_Region'], right_on=['date_implemented', 'country'], how='left')[['Country_Region', 'Date', 'ConfirmedCases', 'Fatalities', 'measure', 'category']]
t = into_to_q.groupby(['Country_Region', 'measure'])['Date'].agg(max).reset_index()
t = t[t.measure == 'Introduction of quarantine policies']
def f(date, country):
try:
return True if t[t.Country_Region==country]['Date'].values[0] <= date else False
except IndexError:
return 'unknown'
into_to_q['quarantine_affective'] = into_to_q.apply(lambda x: f(x['Date'], x['Country_Region']), axis=1)
# discarding countries, that we have no ongo about
into_to_q = into_to_q[~(into_to_q['quarantine_affective']=='unknown')]
List of countries, that introduced quarantine before the first confirmed case
con_before_quar = list(set(into_to_q[(into_to_q.quarantine_affective==True) & (into_to_q.ConfirmedCases == 0)]['Country_Region']))
pd.DataFrame(con_before_quar, columns=['country'])
Next, I'd like to comapare the mortality per 1 milion of the above countries, as well as countries that began quarantine later. I will load the population dataset from https://www.kaggle.com/tanuprabhu/population-by-country-2020
there is availible information about the med. age of the citizens, which we might use later.
population = pd.read_csv('population_by_country_2020.csv')[['Country (or dependency)', 'Population (2020)', 'Med. Age']]
stats_population = pd.merge(grouped, population, left_on=['Country_Region'], right_on=['Country (or dependency)'])
stats_population['death_per_1M'] = (stats_population['Fatalities']/stats_population['Population (2020)'])*1000000
stats_population['infected_per_1M'] = (stats_population['ConfirmedCases']/stats_population['Population (2020)'])*1000000
Based on the information, availible to us, we can compare mean values of deaths per 1M and mean value of infected per 1M between sets of countries who introduced policies before the first case, and the ones after.
print("Mean number of deaths per 1M in countries starting quarantin early: ", stats_population[stats_population.Country_Region.isin(con_before_quar)]['death_per_1M'].mean())
print("Mean number of cases per 1M in countries starting quarantin early: ", stats_population[stats_population.Country_Region.isin(con_before_quar)]['infected_per_1M'].mean())
print()
print("Mean number of deaths per 1M in countries starting quarantin late: ", stats_population[(stats_population.Country_Region.isin(into_to_q.Country_Region)) & (~stats_population.Country_Region.isin(con_before_quar))]['death_per_1M'].mean())
print("Mean number of cases per 1M in countries starting quarantin late: ", stats_population[(stats_population.Country_Region.isin(into_to_q.Country_Region)) & (~stats_population.Country_Region.isin(con_before_quar))]['infected_per_1M'].mean())
How does the median age of the citizens affect mortality?
stats_population = stats_population[~(stats_population['Med. Age']=='N.A.')]
stats_population['Med. Age']=stats_population['Med. Age'].map(int)
age_dist = stats_population.groupby('Med. Age').agg({'Country_Region':lambda x: list(x), 'death_per_1M':'mean'}).reset_index()
fig = go.Figure(data=[
go.Bar(name='Avarage Number of deaths per 1M based on median age of citizens', x=age_dist['Med. Age'], y=age_dist['death_per_1M'], hoverlabel=None, text=age_dist['Country_Region'])
])
fig.update_layout(xaxis_type='category',
barmode='group',
title_text="Avarage Number of deaths per 1M based on median age of citizens",
title_font_size=30,
yaxis_title="Median Age", xaxis_tickangle=-40)
fig.show()
Based on this plot, the conclusion can be made, that countries with older population have greater number of deathes. Thus, older people are at higher risk of dying due to coronavirus.
The countries and their median age distribution can be seen on the plot below.
stats_population = stats_population.sort_values(by='Med. Age', ascending=False)
fig = go.Figure(data=[
go.Bar(name='Median age of citizens', x=stats_population['Country_Region'], y=stats_population['Med. Age'], hoverlabel=None)
])
fig.update_layout(xaxis_type='category',
barmode='group',
title_text="Median age of citizens",
title_font_size=30,
yaxis_title="Median Age", xaxis_tickangle=-40)
fig.show()
I am Ukrainian, and I am curios about the situation in my homeland. The first case was documented on the 3rd of March. School closed on the 9th, national quarantine started on the 11th (when number was still 1). This 8 day delay in reaction did not help to slow down the exponential growth of infected.
measures_vs_country('Ukraine', 100)
print("Number of confirmed cases per 1M in Ukraine: ", stats_population[stats_population.Country_Region=='Ukraine']['infected_per_1M'].values[0])
print()
print("Number of deaths per 1M in Ukraine: ", stats_population[stats_population.Country_Region=='Ukraine']['death_per_1M'].values[0])
print()
print('Mortality in Ukraine: ', round(stats_population[stats_population.Country_Region=='Ukraine']['Fatalities'].values[0]/stats_population[stats_population.Country_Region=='Ukraine']['ConfirmedCases'].values[0]*100, 2),' %')
Cabinet of ministers of Ukraine provides the data about hospitals, doctors' availibility and numbers of artificial breathing ventilators (https://covid19.gov.ua/vidkryti-dani).
Based on the info about number of new patients, what percentage of people will be able to recieve medical aid?
doctors = pd.read_csv('covid19_likari.csv', sep=';')
doctors['Дата'] = pd.to_datetime(doctors['Дата'])
doctors['Внесено'] = pd.to_datetime(doctors['Внесено']).dt.date
doctors['Внесено'] = pd.to_datetime(doctors['Внесено'])
doctors = doctors[(doctors['Внесено'] == '2020-08-04')]
doctors.head()
fig = make_subplots(
rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]],
subplot_titles=("Availibility of doctors", 'Availibility of nurses'))
labels = ['Availible','Already working with covid-19 patients']
total = doctors['Всі лікарі, які можуть бути задіяні з COVID-19 (усі спеціальності)'].sum()
involved_with_patients = doctors['Лікарі, які вже працюють з пацієнтами з COVID-19'].sum()
availible = total-involved_with_patients
values = [availible, involved_with_patients]
fig.append_trace(go.Pie(labels=labels, values=values, hole=.5), row=1, col=1)
total = doctors['Всі медсестри'].sum()
involved_with_patients = doctors['Медсестри, які вже працюють з пацієнтами з COVID-19'].sum()
availible = total-involved_with_patients
values = [availible, involved_with_patients]
fig.append_trace(go.Pie(labels=labels, values=values, hole=.5), row=1, col=2)
fig.update_layout(title='Medical staff (8.04.2020)', title_font_size=30)
fig.show()
Based on the most recent availible data (8th of April), 730 doctors and 1287 nurses are already occupied with covid-19 patients. As of 8th of April there were 1668 confirmed cases, 35 healed patients and 52 letal cases. This is 1518 patients, where only part of them requires hospitalization. In case of more infected - Ukraine has enough professionals. Next, we will take a look at availibility of hospital spaces and ICUs
labels = ['High Class Ventilators', 'Mid Class Ventilators', 'Kids Ventilators']
high_class = doctors['Справних ШВЛ високого класу'].sum()
mid_class = doctors['Справних ШВЛ середнього класу'].sum()
# portable = doctors['Справних портативних (транспортних) ШВЛ '].sum()
kids = doctors['Справних ШВЛ дитячих (0-6 років)'].sum()
outer_values = [high_class,mid_class,kids]
high_class_used = doctors['ШВЛ високого класу підключено до пацієнтів'].sum()
high_class_avbl = high_class - high_class_used
mid_class_used = doctors['ШВЛ середнього класу підключено до пацієнтів'].sum()
mid_class_avlb = mid_class - mid_class_used
kids_used = doctors['Пацієнтів підключених до ШВЛ дитячих'].sum()
kids_avlb = kids - kids_used
inner_values = [high_class_used, high_class_avbl, mid_class_used, mid_class_avlb,
kids_used, kids_avlb]
labels2 = ['used_', 'availible_', 'used', 'availible', '_used', '_availible']
common_props = dict(labels=labels,
values=values,)
trace1 = go.Pie(
hole=0.5,
sort=False,
direction='clockwise',
domain={'x': [0.15, 0.85], 'y': [0.15, 0.85]},
values=inner_values,
textinfo='label',
textposition='inside',
labels=labels2,
marker={'line': {'color': 'white', 'width': 1}}
)
trace2 = go.Pie(
hole=0.7,
sort=False,
direction='clockwise',
values=outer_values,
labels=labels,
textinfo='label',
textposition='outside',
marker={'colors': ['green', 'red', 'blue'],
'line': {'color': 'white', 'width': 1}}
)
fig = go.FigureWidget(data=[trace1, trace2])
fig.update_layout(title='Availibility of Artificial Breathing Ventilators', title_font_size=30,
showlegend=False)
fig.show()
beds_total = doctors['Ліжок загалом'].sum()
occupied = doctors['Зайнято загалом ліжок'].sum()
occupied_cv = doctors['Зайнято хворими з COVID-19'].sum()
occupied_others = occupied - occupied_cv
avlb_beds = beds_total - occupied
labels = ['Availible Hospital Beds','Occupied by covid-19 patients', 'Occupied by other patients']
beds_total = doctors['Ліжок загалом'].sum()
occupied = doctors['Зайнято загалом ліжок'].sum()
occupied_cv = doctors['Зайнято хворими з COVID-19'].sum()
occupied_others = occupied - occupied_cv
avlb_beds = beds_total - occupied
values = [avlb_beds, occupied_cv, occupied_others]
fig= go.Figure(data=[(go.Pie(labels=labels, values=values, hole=.5))])
fig.update_layout(title='Hospital beds availibility (8.04.2020)', title_font_size=30)
fig.show()
Concluding from the data availible - Ukraine has the potential to fight with the virus and to save lives of the citizens. Out of 1668 (8.04.20) confirmed cases only 748 are currently hospitalized, it is 44%. With over 39K availible hospital places left, there should be no shortage of spots. However, for the ICUs - we don't know exactly whether they are in use by covid or other patients, so it is hard to estimate their future availibility.
- Governments, who introduced quarantine in their countries before the first covid-19 patient tend to have lower number of infected, as well as lower number of deaths;
- Quarantine and social distancing are indeed making a difference in the spread of virus;
- Countries with older population (over 45 median age) tend to have higher death rate;
- Any predictive analysis for coronavirus will assume, that the situation is static. However, there are a lot of factors that either can not be descibed by data or the data for which is currently unavailible. These factors include, but are not limited to how well do the citizens comply with the government policies, for how much longer will the borders stay closed, what is the spread rate between medical staff and from here -patient-doctor ratio or even potential development of cure for the virus.